After aligning raw RNA sequencing (RNA-seq) reads to the reference genome and counting the number of reads mapped to genomic features, it can be appropriate to perform differential gene expression (DGE) analysis. In this vignette, we will be using the DESeq2 package in R to do this (Love, Huber, & Anders, 2014). edgeR (Robinson, McCarthy, and Smyth, 2009) is another popular package for DGE analysis and similarly employs the negative binomial distribution to model read counts and to adjust dispersion estimates based on trends across all samples and genes, however these methods are not identical and can yield different results. limma (Law et al., 2014) is another older package employing linear models that was previously developed to analyze microarray gene expression data. It has been continually updated over the years to work with RNA-seq data, however it is less popular than DESeq2 and edgeR today, so will not be covered in this document.
The following packages are required to perform the analyses described thereafter:
| Package | Version |
|---|---|
| DESeq2 | 1.28.1 |
| apeglm | 1.10.0 |
| ggplot2 | 3.3.5 |
| ggpubr | 0.4.0 |
| ggtext | 0.1.1 |
| ggrepel | 0.9.1 |
| vsn | 3.56.0 |
| pheatmap | 1.0.12 |
| RColorBrewer | 1.1-2 |
| rstatix | 0.7.0 |
| ggVennDiagram | 1.2.0 |
| UpSetR | 1.4.0 |
| karyoploteR | 1.14.1 |
It is imperative that we provide a table to DESeq2 to interpret the counts matrix properly, grouping sequenced RNA library replicates as we direct. Build this table in Microsoft Excel, Google Sheets, or using R to describe each of the samples in the counts matrix.
If the counts matrix looks like the following:
| SRR0001234_KO1_S01_L001.Counts | SRR0001235_KO1_S01_L002.Counts | SRR0001236_KO2_S02_L001.Counts | SRR0001237_WT1_S03_L001.Counts | SRR0001238_WT2_S04_L001.Counts | SRR0001239_WT2_S04_L002.Counts | SRR0009876_UP1_S05_L001.Counts | |
|---|---|---|---|---|---|---|---|
| GENEID1.9 | 0 | 0 | 4 | 1 | 0 | 0 | 0 |
| GENEID2.1 | 29 | 35 | 40 | 48 | 37 | 52 | 15 |
| GENEID3.5 | 21 | 16 | 2 | 1 | 0 | 0 | 42 |
| GENEID4.2 | 2 | 6 | 0 | 1 | 0 | 0 | 11 |
| GENEID5.7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| GENEID6.4 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
| GENEID7.3 | 0 | 0 | 0 | 0 | 2 | 8 | 12 |
| GENEID8.6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Then the table of sample information should look like this:
| cell_type | condition | sample | run | |
|---|---|---|---|---|
| SRR0001234_KO1_S01_L001.Counts | RPMI | KO | KO1 | KO1_L001 |
| SRR0001235_KO1_S01_L002.Counts | RPMI | KO | KO1 | KO1_L002 |
| SRR0001236_KO2_S02_L001.Counts | RPMI | KO | KO2 | KO2_L001 |
| SRR0001237_WT1_S03_L001.Counts | RPMI | WT | WT1 | WT1_L001 |
| SRR0001238_WT2_S04_L001.Counts | RPMI | WT | WT2 | WT2_L001 |
| SRR0001239_WT2_S04_L002.Counts | RPMI | WT | WT2 | WT2_L002 |
| SRR0009876_UP1_S05_L001.Counts | RPMI | UP | UP1 | UP1_L001 |
If you built the table in a program other than R, save it as a *.csv file in your working directory. Load both of these files from your working directory into R as data objects.
cts <- read.table("countsMatrix.consensus.txt", header=T, sep="\t", row.names="Geneid")
cts
## SRR0001234_KO1_S01_L001.Counts SRR0001235_KO1_S01_L002.Counts SRR0001236_KO2_S02_L001.Counts SRR0001237_WT1_S03_L002.Counts SRR0001238_WT2_S04_L001.Counts SRR0001239_WT2_S04_L002.Counts SRR0009876_UP1_S05_L001.Counts
## GENEID1.6 0 0 4 1 0 0 0
## GENEID2.2 29 35 40 48 37 52 15
## GENEID3.3 0 0 2 1 0 0 42
## GENEID4.7 2 6 0 1 0 0 11
## GENEID5.9 0 0 0 0 0 0 0
## GENEID6.4 0 2 0 0 0 0 0
## GENEID7.1 0 0 0 0 2 8 12
## GENEID8.5 0 0 0 0 0 0 0
coldata <- read.csv(file="annotationFile.csv", sep=",", fileEncoding="UTF-8-BOM", row.names=1)
coldata
## cell_type condition sample run
## SRR0001234_KO1_S01_L001.Counts RPMI KO KO1 KO1_L001
## SRR0001235_KO1_S01_L002.Counts RPMI KO KO1 KO1_L002
## SRR0001236_KO2_S02_L001.Counts RPMI KO KO2 KO2_L001
## SRR0001237_WT1_S03_L001.Counts RPMI WT WT1 WT1_L001
## SRR0001238_WT2_S04_L001.Counts RPMI WT WT2 WT2_L001
## SRR0001239_WT2_S04_L002.Counts RPMI WT WT2 WT2_L002
## SRR0009876_UP1_S05_L001.Counts RPMI UP UP1 UP1_L001
If your input data contains columns (for cts) or rows (for coldata) that are unnecessary for your immediate analysis, subset them.
cts <- cts[,c(1:6)] # excludes column 7 from the original data in this subset
cts
## SRR0001234_KO1_S01_L001.Counts SRR0001235_KO1_S01_L002.Counts SRR0001236_KO2_S02_L001.Counts SRR0001237_WT1_S03_L002.Counts SRR0001238_WT2_S04_L001.Counts SRR0001239_WT2_S04_L002.Counts
## GENEID1.6 0 0 4 1 0 0
## GENEID2.2 29 35 40 48 37 52
## GENEID3.3 0 0 2 1 0 0
## GENEID4.7 2 6 0 1 0 0
## GENEID5.9 0 0 0 0 0 0
## GENEID6.4 0 2 0 0 0 0
## GENEID7.1 0 0 0 0 2 8
## GENEID8.5 0 0 0 0 0 0
coldata <- coldata[c(1:6),] # excludes row 7 from the original data in this subset
coldata
## cell_type condition sample run
## SRR0001234_KO1_S01_L001.Counts RPMI KO KO1 KO1_L001
## SRR0001235_KO1_S01_L002.Counts RPMI KO KO1 KO1_L002
## SRR0001236_KO2_S02_L001.Counts RPMI KO KO2 KO2_L001
## SRR0001237_WT1_S03_L001.Counts RPMI WT WT1 WT1_L001
## SRR0001238_WT2_S04_L001.Counts RPMI WT WT2 WT2_L001
## SRR0001239_WT2_S04_L002.Counts RPMI WT WT2 WT2_L002
Check to make sure all column names of the counts matrix (cts) and all row names of the column data (coldata) match spelling and order perfectly.
all(rownames(col.k) %in% colnames(cts.k)) # check spelling matches; 'TRUE' is good
all(rownames(col.k) == colnames(cts.k)) # check order of samples; 'TRUE' is good
If the result of either command is FALSE… STOP! This needs to be addressed before proceeding. Examples of fixes:
# if the suffix ".Counts" is missing in the coldata, append the string to the end of the row names
rownames(coldata) <- paste0(rownames(coldata), ".Counts")
# if the names match but are not in the same order, make the order of cts match that of coldata
cts <- cts[, rownames(coldata)]
Since our count data from featureCounts is in a matrix format, we can use the DESeqDataSetFromMatrix() function to create a DESeqDataSet object of a specified experimental design. Here, ~ condition specifies to test for the effect each condition has on the samples. Alternatively, using ~ cell_type + condition would specify to test for the effect each condition has on the samples while controlling for the effect of the cell type (if they were different). Also, using ~ cell_type + condition + cell_type:condition specifies to test for the difference in effect on genes between conditions across cell types, creating interactions between the terms separated by the colon.
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ condition)
Note: If you did not encode the vector(s) used in the design formula to be factors after loading in your
coldata, they are likely of the character class, soDESeq2will convert these to factors.
Once the DESeqDataSet has been prepared, technical replicates can be collapsed [if you have them]. Do NOT collapse biological replicates; you will lose statistical power by doing so. The collapseReplicates() function sums the counts of technical replicates (multiple sequencing runs of the same library).
ddsColl <- collapseReplicates(dds, dds$sample, dds$run)
# ensure the dds "sample1" counts sum matches the combined count of ddsColl "sample1"
matchFirstLevel <- dds$sample == dds$sample[1] # use this if sample column is character class
matchFirstLevel <- dds$sample == levels(dds$sample)[1] # use this if sample column is factor class
# output: samples with the same name/level as value 1 all become TRUE
stopifnot(all(rowSums(counts(dds[,matchFirstLevel])) == counts(ddsColl[,1])))
If you receive the error message: Error: all(rowSums(counts(dds[, matchFirstLevel])) == counts(ddsColl[, .... is not TRUE, do not panic! The order of the dds object library names very likely does not match the order of the ddsColl object which took the values of the “sample” vector as their new names after collapsing. Changing the column index number in the last command, counts(ddsColl[,#]), to match the correct column will circumvent this error.
Removing genes with very few reads, considered ‘unexpressed’ genes, is recommended to speed up the downstream DESeq2 ‘per-gene’ model fitting process (i.e., for calculating dispersion and fold change estimates). Plus, if for a single gene all samples have zero counts, DESeq2 will determine its baseMean to be zero and log2FoldChange, pvalue, and padj will be set to NA. Ultimately, this is unuseful information, so removing these rows with pre-filtering before processing the DESeqDataSet saves computational time. Note that pre-filtering the data is not necessary for considering multiple testing burden because of downstream independent filtering on the mean of normalized counts.
# pre-filter out rows with less than 10 reads across all samples
dds <- ddsColl[rowSums(counts(ddsColl)) >= 10,]
# alternatively, keep only rows where at least two samples (smallest group size in data set) have 10 reads or more
dds <- ddsColl[rowSums(counts(ddsColl) >= 10) >= 2,]
To see at a glance how many genes will remain in the data set after filtering, use the following command:
# option 1
table(rowSums(counts(ddsColl)) >= 10)
## FALSE TRUE
## 25047 35562
# option 2
table(rowSums(counts(ddsColl) >= 10) >= 3)
## FALSE TRUE
## 33448 27161
TRUE = the number of genes passing the filter threshold. FALSE = the number of genes filtered out of the data set by the filter threshold.
When constructing the DESeqDataSet earlier, an experimental design was chosen based on some descriptive column(s) in the coldata object. In this vignette, the “condition” column was specified. The “condition” column is of the Factor class and contains two levels: “WT” and “KO”. Before performing the DE analysis, select the level representing the experimental control/reference samples. Then, run the DE analysis to estimate size factors, dispersions, and gene-wise dispersions, as well as compute the mean-dispersion relationship and fit a model to the data.
dds$condition <- relevel(dds$condition, ref = "WT")
dds <- DESeq(dds)
To check which condition comparisons are possible, list the possible coefficients/contrasts.
resultsNames(dds)
## [1] "Intercept" "condition_KO_vs_WT"
Use the results() function to filter the DESeqDataSet and extract the DGE results for a condition comparison of interest to a DESeqResults object. If the adjusted p-value cutoff will be a value other than 0.1 (or 10%; DESeq2 default), set a significance level, alpha, matching the alternative padj cutoff. The wrapped (hidden) genefilter::filtered_p() function will use this significance level to find a ‘mean of normalized counts’ threshold that optimizes the number of adjusted p-values lower than alpha, in agreement with the Benjamini-Hochberg correction methodology. Any genes not passing the filter threshold will have their padj set to NA. The genefilter package’s independent filtering programme is optimal for power. It is important to change the alpha value if you decide to change the false discovery rate threshold, rather than simply applying a new filter to the last generated data set, so an adjusted independent filtering can be performed.
# default FDR cutoff is 0.1 but 0.05 and 0.01 are more common
res.01 <- results(dds, contrast=c("condition","KO","WT"), alpha=0.01)
# or
res.01 <- results(dds, name="condition_KO_vs_WT", alpha=0.01)
While the contrast and name arguments perform nearly the same task, one notable difference is that contrast will set the estimated log2FoldChange in a comparison of two groups to zero when all counts of both groups are zero. This can be a useful feature and is what I prefer to use.
A basic summary of the DESeqResults can be accessed using the summary() function.
summary(res.01)
## out of 28757 with nonzero total read count
## adjusted p-value < 0.01
## LFC > 0 (up) : 1653, 5.7%
## LFC < 0 (down) : 1244, 4.3%
## outliers [1] : 0, 0%
## low counts [2] : 3903, 14%
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
A matrix of the normalized counts per gene can also be extracted if desired.
res.mat <- cbind(counts(dds, normalized=TRUE))
DESeqResults object contents| baseMean | mean of normalized counts divided by the size factors for all samples |
| log2FoldChange | effect size estimate based on experimental design; LFC of 1 = FC of 2 and thus ±2X expression compared to the control |
| lfcSE | standard error of the log2FoldChange estimate |
| stat | the Wald statistic (or likelihood ratio test); the log2FoldChange divided by the lfcSE (or the difference in deviance between the reduced model and the full model) |
| pvalue | the probability that the observed FC is due to the null hypothesis being true using the Wald test (or the LRT) |
| padj | Benjamini-Hochberg adjusted p-values correcting for an expected rate of false positive discoveries |
If pvalue = NA, all counts for the gene were zero or an extreme count outlier was detected; no test was applied.
Shrinking the effect size of the log fold change estimates using the apeglm method (2018) is useful for visualizations and ranking of genes. The results() function produces maximum likelihood estimates (MLEs) for the log2FoldChange which can be highly variable for low count genes. The apeglm method produces Bayesian estimates for the log2FoldChange – using the adaptive Student’s t prior shrinkage estimator – which are more appropriate to use when comparing effect sizes. DESeq2 conveniently wrapped the functions of this method into a single function, lfcShrink().
resLFC.01 <- lfcShrink(dds, res=res.01, coef="condition_KO_vs_WT", type="apeglm")
Note: A base of 2 is used for the LFC instead of 10 because it is much more common to have a 2X fold change than it is a 10X fold change between expressed genes. Therefore, a logarithm of base 10 could be used, but it is more appropriate to show the data on a \(log_2\) scale where it is easily interpreted that 1 is doubled expression, 2 is quadrupled expression, etc.
“To rank genes by effect size instead of by padj, use abs(log2FoldChange) with decreasing=TRUE.”
A useful addition to the base listData of the DESeqResults object (baseMean, log2FoldChange, lfcSE, pvalue, padj) is the gene position information and other attributes described in the GTF annotation used during the mapping and counting of the RNA-seq reads. These metadata can be useful for visualizations requiring gene coordinate data and for further filtering the results.
To extract this information from a GENCODE GTF annotation, go back to the Unix/Linux command line:
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
ANNOTATION=/path/to/genome/annotation/file.gtf
OUTPUT=/path/to/desired/output/folder
#######################################
#######################################
# DO NOT EDIT
cat $ANNOTATION | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"$1"\t"$4"\t"$5"\t"$7"\t"$5-$4"\t"a[2]}' | sed 's/gene_id "//' | sed 's/gene_name "//' | sed 's/gene_type "//' | sed 's/[" ]//g' | sed "1i\Geneid\tGeneSymbol\tChromRange\tChromosome\tStart\tEnd\tStrand\tLength\tGeneType" > $OUTPUT/GTFannotExtract.txt
#######################################
Load this file from your working directory into R as a data object.
ano <- read.table("GTFannotExtract.txt", header=T, sep="\t", row.names="Geneid")
ano
## GeneSymbol ChromRange Chromosome Start End Strand Length GeneType
## ENSG00000223972.5 DDX11L1 chr1:11869-14409 chr1 11869 14409 + 2540 transcribed_unprocessed_pseudogene
## ENSG00000227232.5 WASH7P chr1:14404-29570 chr1 14404 29570 - 15166 unprocessed_pseudogene
## ENSG00000278267.1 MIR6859-1 chr1:17369-17436 chr1 17369 17436 - 67 miRNA
## ENSG00000243485.5 MIR1302-2HG chr1:29554-31109 chr1 29554 31109 + 1555 lncRNA
## ...
Append the GTF metadata to the results table by coercing the DESeqResults object to a data.frame object and merge rows between the two tables if and only if the row names match perfectly.
If the recommended input for the
ATTRIBUTEvariable was maintained during the “Map, Count, and Normalize” section of the RNA-seq Data Analysis Pipeline (ATTRIBUTE=gene_id), therow.namesof the results table are made to be stable Ensembl identifiers with version numbers. Without modification, theread.table()function above also sets therow.namesof the annotation table to be stable Ensembl identifiers with version numbers.
resLFC.01.fin <- merge(as(resLFC.01, "data.frame"), ano, by.x="row.names", by.y="row.names")
resLFC.01.fin <- data.frame(resLFC.01.fin, row.names = 1)
Dispersion estimates from the DE analysis can be plotted using the plotDispEsts() function from DESeq2. This visually demonstrates how the DESeq() function shrinks the gene-wise estimates closer to a fitted model, resulting in final estimates.
plotDispEsts(dds)
A principal components analysis (PCA) plots values of the samples as PC1 and PC2, which constitute orthogonal directions on a 2D plane. The data points spread out in these two directions that explain most (> 50%) of the differences between samples. There are more than two dimensions that contain variance, thus the two values do not usually sum to 100%, but two components are normally enough to explain the observed variance. The percent of the total variance contained in either direction can be found in the axis labels.
Prior to plotting the principal components of the data set, the count data should be transformed. The type of transformation applied depends on the structure of the data:
vst() function; a parametric fit for the dispersion; less sensitive to high count outliers; recommended for medium-to-large datasets (\(n\) > 30)rlog() function; more sensitive to high count outliers; recommended for small datasets (\(n\) < 30) with a range of sequencing depths across samplesnormTransform() function; this uses the equation \(log_2(\text{count} + 1)\) where adding a pseudocount of 1 avoids taking the log of zeroThe goal of the transformation is to make the data homoskedastic. That is, to make “the expected amount of variance approximately the same across different mean values”, which is not how counts from RNA-seq experiments typically behave. Normally, the expected variance increases with the mean. Once the data appears roughly homoskedastic, it can be used to create PCA plots, calculate distances between samples, and other methods requiring data of this form.
Creating each DESeqTransform object is as simple as:
vsd <- vst(dds, blind=FALSE)
rld <- rlog(dds, blind=FALSE)
ntd <- normTransform(dds)
Visualize the standard deviations of the transformed data for each type of transformation using the vsn package’s meanSdPlot() function.
meanSdPlot(assay(vsd))
meanSdPlot(assay(rld))
meanSdPlot(assay(ntd))
After selecting the transformation that performs best with the data, plot the PCA. For the example used in this vignette, the VST method was most appropriate, so the vsd DESeqTransform object will be used for all plots requiring homoskedastic data. If a different transformation works better for your data, replace the vsd data with whichever you choose.
pcaData <- plotPCA(vsd, intgroup="condition", returnData=TRUE) # choose an interesting group for colour grouping the samples
percentVar <- round(100 * attr(pcaData, "percentVar")) # calculation of total variance
ggplot(pcaData, aes(PC1, PC2, color=condition)) +
geom_point(size=3) +
ggtitle("TITLE \n PCA (after variance stabilizing transformation)") +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
labs(colour="Condition") +
coord_fixed() +
theme_classic() +
theme(panel.border = element_rect(colour = "black", fill = NA, size = NULL), axis.line = element_blank(),
plot.title = element_text(face="bold", colour="black", size=12, hjust=0.5),
axis.title.x = element_text(face="plain", colour="black", size=8),
axis.title.y = element_text(face="plain", colour="black", size=8),
legend.title = element_text(face="plain", colour="black", size=10),
legend.text = element_text(face="plain", colour="black", size=8))
Here, the differences due to the gene knockout (KO) explain almost ALL of the variance between samples. This validates that we should be using an experimental design of ~ condition for the DE analysis.
First, Euclidean distances must be calculated between samples where all genes contribute roughly equally, thus homoskedastic data should be used (see the PCA tab for an explanation about the use of transformations to produce homoskedastic data).
sampleDistMatrix <- as.matrix( dist(t(assay(vsd))) )
rownames(sampleDistMatrix) <- paste(vsd$condition)
colnames(sampleDistMatrix) <- NULL
colours <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
pheatmap(sampleDistMatrix,
main="TITLE \n(after variance stabilizing transformation)",
drop_levels=TRUE,
clustering_distance_rows=dist(t(assay(vsd))),
clustering_distance_cols=dist(t(assay(vsd))),
col=colours)
For the purposes of the vignette, an example of a sample distance heatmap for each type of data transformation is shown above. Note that although they all look basically the same, each has a different scale.
ggplot(resLFC.01.fin, aes(x = pvalue)) + geom_histogram(binwidth = 0.025, boundary = 0) +
coord_cartesian(ylim = c(0, 5000)) +
labs(title = "TITLE", x = "p-value", y = "gene count") +
annotate("text", label = paste0("n = ", nrow(resLFC.01.fin)), x = Inf, y = Inf, size = 4, colour = "black", vjust=2, hjust=1.5) +
theme(panel.border = element_rect(colour = "black", fill = NA, size = NULL), axis.line = element_blank(),
plot.title = element_text(face="bold", colour="black", size=12, hjust=0.5),
axis.title.x = element_text(face="plain", colour="black", size=10),
axis.title.y = element_text(face="plain", colour="black", size=10),
legend.title = element_text(face="plain", colour="black", size=10),
legend.text = element_text(face="plain", colour="black", size=8))
Hypothetical setup: If the RNA of many control samples is sequenced, it is expected that there will be no significant variation of gene expression between them and the expression profile of the whole transcriptome will have a normal or log-normal distribution. If the difference between samples is tested multiple times, the distribution of resulting p-values should be about even. This is the “null p-value distribution”.
In an experiment considering 1000 genes, by chance alone, about 50 p-values are expected to be in each 0.05 bin from 0 to 1, of which there are 20 (\(\frac{1000 \text{ genes}}{20 \text{ bins}} ={}\sim 50 \text{ genes per bin}\)).
When examining real data from some gene expression experiment (e.g., gene knockout transcriptome vs. wildtype transcriptome), it is reasonable to suspect that a subset of genes will be differentially expressed and thus will not fit a null p-value distribution. If the effects of the experimental design on some genes are large enough that they appear as different from the background distribution, the corresponding p-values will be very small.
Continuing with this example, imagine now that ~60 of the 1000 genes are differentially expressed (true positives; TP). These 60 genes could have had p-values from anywhere along the mostly even null distribution above, but now they will appear in the smallest 0 - 0.05 bin. If a p-value threshold (\(p_T\)) of 0.05 is used alone in an attempt to identify the DEGs resulting from real biological variation, the ~60 true positives will be included, but unfortunately so will the other 50 genes that were already present in this bin in the null distribution (false positives; FP). Thus, the \(p_T\) will detect ~110 genes as the total number of observed positives (\(\text{TP}+\text{FP} = N_{obs}\)).
Note: Naturally, for a p-value threshold to be applied across the total number of hypotheses (\(m\)) in a data set, all p-values must first be put in order from smallest to largest. This list becomes a positional index when each value is given a rank (\(i\)) whereby the smallest p-value is given the rank of one (1) and each subsequent test increments by one (1) rank up to \(m\). \[p_{(1)} \le p_{(2)} \le ... \le p_{(m)}\]
If the background distribution is a decent approximation (roughly normal), the expected number of false positives can be estimated for any \(p_T\): \[\begin{align*} E(N_{FP}) &= m \times p_T \\ &= 1000 \times 0.05 \\ &= 50 \end{align*}\] Therefore, about 50 false positives are expected in a data set of 1000 hypotheses/tests, corresponding to genes, when a p-value threshold of 0.05 is set.
To avoid extracting these false positives, a false discovery rate (FDR) correction can be applied to determine which of the genes with a p-value smaller than 0.05 (\(N_{obs}\)) are true positives and which are false positives. The false positive rate is defined as:
\[\begin{align*} q &\ge \frac{E(N_{FP})}{N_{obs}} \\ q &\ge \frac{m \times p_T}{N_{obs}} \end{align*}\]
Now, consider our list of rank ordered p-values (\(p_{(i)}\)). By rearranging the false positive ratio equation above and substituting the \(p_T\) with a p-value and the \(N_{obs}\) with the rank of that p-value, then programmatically evaluating the equation for all p-values in the data set using an appropriate \(q\) (target FDR; e.g., \(q = 0.10\)) to generate an elastic or stepped FDR threshold for each rank, we can mathematically determine the largest rank (\(i_{\text{max}}=k\)) at which an observed p-value is less than its specific FDR threshold. This is how the Benjamini-Hochberg linear step up procedure works.
\[\begin{align*} q &\ge \frac{m \times p_{(i)}}{i} \\ \frac{i}{m} \cdot q &\ge p_{(i)} \\ p_{(i)} &\le \frac{i}{m} \cdot q \end{align*}\]
All genes with a rank up-to-and-including \(k\) are considered significant DEGs (the null hypothesis is rejected). Consider a data set with 17 total hypotheses and a target FDR of 0.05:
| Observed \(p\)-value | Rank | FDR thresholds | \(i/m*q\) | |
|---|---|---|---|---|
| GENEID49.5 | 0.000001 | 1 | 0.0029 | 1/17*0.05 |
| GENEID65.7 | 0.000013 | 2 | 0.0058 | 2/17*0.05 |
| GENEID25.9 | 0.000065 | 3 | 0.0088 | 3/17*0.05 |
| GENEID74.8 | 0.000630 | 4 | 0.0117 | 4/17*0.05 |
| GENEID18.4 | 0.000800 | 5 | 0.0147 | 5/17*0.05 |
| GENEID99.2 | 0.001700 | 6 | 0.0176 | 6/17*0.05 |
| GENEID47.3 | 0.003200 | 7 | 0.0205 | 7/17*0.05 |
| GENEID24.6 | 0.006500 | 8 | 0.0235 | 8/17*0.05 |
| GENEID71.1 | 0.014800 | 9 | 0.0264 | 9/17*0.05 |
| GENEID89.5 | 0.049000 | 10 | 0.0294 | 10/17*0.05 |
| GENEID37.7 | 0.094000 | 11 | 0.0323 | 11/17*0.05 |
| GENEID20.9 | 0.110000 | 12 | 0.0352 | 12/17*0.05 |
| GENEID26.8 | 0.150000 | 13 | 0.0382 | 13/17*0.05 |
| GENEID3.4 | 0.240000 | 14 | 0.0411 | 14/17*0.05 |
| GENEID41.2 | 0.450000 | 15 | 0.0441 | 15/17*0.05 |
| GENEID27.3 | 0.560000 | 16 | 0.0470 | 16/17*0.05 |
| GENEID36.6 | 0.870000 | 17 | 0.0500 | 17/17*0.05 |
Based on the target FDR \(q\) used, one can expect about that fraction of identified positives to actually be false positives (for above, 5% of \(k\)). This is an acceptable imprecision because this correction method also allows more true positives to be captured than more stringent procedures (i.e., Bonferroni correction).
When working with data from high-throughput experiments, like RNA-seq data, it is generally poor form to use ‘p-values below a threshold’ as evidence against the null hypothesis. With such a large number of variables, and thus multiple hypotheses, adjusted p-values (also known as q-values) are a more reliable indicator of significance because they have been corrected to account for the probable presence of some expected number of false positives. The key is to choose a target FDR that will remove most false positives and retain most true positives; it is an arbitrary threshold, but one that can be informed by exploring the data through visualizations like the p-value histogram. A fold change or log fold change cutoff can also be used to further refine the final selection of DEGs.
# order the results from smallest to largest adjusted p-value
resOrd.01 <- resLFC.01.fin[order(resLFC.01.fin$padj),]
# or order the results from smallest to largest p-value
resOrd.01 <- resLFC.01.fin[order(resLFC.01.fin$pvalue),]
# subset only results passing some significance criteria
resSig.01 <- subset(resOrd.01, padj < 0.01 & (log2FoldChange > 1 | log2FoldChange < -1))
# export results to file
write.csv(as.data.frame(resOrd.01),
file="KOvsWT_q0.01_resultsAll.csv")
write.csv(as.data.frame(resSig.01),
file="KOvsWT_q0.1,fc2_resultsSig.csv")
DESeq2 has a plotCounts() function that allows for easy plotting of normalized counts for any gene of interest across conditions. By default, this function adds a pseudocount of 0.5 to ALL genes for plotting on a log scale since 0 cannot be plotted on a log scale y-axis, but if you will not be plotting on a log scale, the transform argument should be modified to prevent inflation of the true normalized counts values. Interestingly, the pc argument can override the transform argument, however they appear to perform the exact same transformation to the data when either is set to TRUE.
normalized counts: counts scaled by the estimated size factors, so that the differences affecting counts across samples – due to experimental variability (e.g., variable sequencing depth) rather than true biological variation – are minimized (scaled == normalized).
For the example below, the read counts of a beloved long non-coding RNA (lncRNA) gene, CISTR-ACT (often annotated as CISTR), were plotted.
resLFC.01.fin[grep("CISTR", resLFC.01.fin$GeneSymbol),] # determine the ensembl ID for a gene symbol of interest
## baseMean log2FoldChange lfcSE pvalue padj GeneSymbol ChromRange Chromosome Start End Strand Length GeneType
## ENSG00000260492.3 67.00365 -8.288568 1.010297 1.335049e-15 3.913638e-14 CISTR chr12:53746337-53757227 chr12 53746337 53757227 - 10890 lncRNA
plotcountsdf <- plotCounts(dds, gene="ENSG00000260492.3", intgroup="condition", transform = FALSE,
returnData=TRUE) # return normalized counts data frame for ggplot2 plotting
ggplot(plotcountsdf, aes(x=condition, y=count)) +
ggtitle("Normalized *CISTR* expression") +
labs(x = "Condition", y = "Read Count") +
coord_cartesian(ylim = c(0, 220)) +
theme_classic() +
theme(panel.border = element_rect(colour = "black", fill = NA, size = NULL),
plot.title = ggtext::element_markdown(face="bold", colour="black", size=12, hjust=0.5),
axis.line = element_blank(),
axis.text = ggtext::element_markdown(),
legend.text = ggtext::element_markdown()) +
geom_boxplot(fill='gray', color="black") +
geom_jitter(shape=16, position=position_jitter(0.1)) +
stat_summary(fun=mean, colour="red", geom="text", show.legend = FALSE, vjust=-0.7, aes(label=round(..y.., digits=1)))
To determine whether any of the observed counts differences in a study comparing two groups of samples are statistically significant, a t-test can be applied. Although we have two related conditions, treated (KO) and untreated (WT) samples, a paired t-test is inappropriate because the data points are not matched pairs (e.g., a single mouse’s weight before and after a treatment).
An independent t-test is more appropriate because the data points were collected from two independent groups (e.g., one population of treated cells and another population of untreated cells). But there are two types of independent t-tests: Student’s t-test (when the per-group sample size is the same AND when the variances of groups are equal) and Welch’s t-test (when the per-group sample sizes are different OR when the variances of groups are unequal). To determine which test should be used, compare the size (\(n\)) of the sample sets using data summary stats and check the equality of variances between the two groups using Levene’s test.
# print sample set size, mean, and sd of the data
plotcountsdf %>% group_by(condition) %>% get_summary_stats(count, type="mean_sd")
## # A tibble: 2 x 5
## condition variable n mean sd
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 WT count 3 200. 19.7
## 2 KO count 3 0.814 1.41
# conduct Levene's test to check equality of variances
plotcountsdf %>% levene_test(count ~ condition)
## # A tibble: 1 x 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 1 4 1.15 0.344
Note: Each sample set must have a size 3 ≤ \(n\) ≤ 5000 for it to be possible to check for normality. If \(n\) < 3, it would be best to assume the data is not normal and proceed with Welch’s test. This assumption is usually supported by the results of Levene’s test (p-value < 0.05). While the statistical test will lack power at smaller sample sizes, it is still possible to perform on as few as \(n\) = 2.
Since the variances of the two groups are approximately equal and each group has the same number of samples, Student’s t-test seems appropriate. However, Student’s t-test is only fair to apply if the data is also (i) free of outliers and (ii) approximately normally distributed (only applicable for small sample sizes, e.g., \(n\) ≲ 20; see Central Limit Theorem), thus, it is important to check that these assumptions about the data are true before performing the test. The normality assumption can be checked using the Shapiro-Wilk test.
# identify outliers
plotcountsdf %>% group_by(condition) %>% identify_outliers(count)
## [1] condition count is.outlier is.extreme
## <0 rows> (or 0-length row.names)
# conduct Shapiro-Wilk test to check normality
plotcountsdf %>% group_by(condition) %>% shapiro_test(count)
## # A tibble: 2 x 4
## condition variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 WT count 0.819 0.160
## 2 KO count 0.75 0
No [extreme] outliers were identified, but the Shapiro-Wilk test showed a significant deviation from normality for one of the two sample sets, W = 0.75, p = 0. Ordinarily, this would indicate that a parametric statistical test, like a t-test, would be inappropriate for the data being analyzed. However, this was an expected outcome because two of the three KO condition count values were zero (0); a sample distribution simply cannot appear normal with only one non-zero value in an entire sample set. Understanding this and how reads are counted, it is fine to proceed as if the data were approximately normal since the non-zero normalized count value is ~2.4 which is close to zero. Of course a non-parametric test, like the Mann-Whitney test, could be performed instead, but these tests have less power.
Why is this fine? (click to expand)
“I’m skeptical about being able to continue with a parametric test despite the brief explanation above.” If this is something you are thinking, then you’ve come to the right place! Let’s run a quick simulation on the same data after replacing the zero-count values with small non-zero values and compare the outcomes:# structure of data above print(plotcountsdf) ## count condition ## ko1 2.441798 KO ## ko2 0.000000 KO ## ko3 0.000000 KO ## wt1 190.007094 WT ## wt2 186.717170 WT ## wt3 222.286733 WT# structure of new data print(new.plotcountsdf) ## count condition ## ko1 2.4417975 KO ## ko2 0.0000001 KO ## ko3 0.5262703 KO ## wt1 190.0070943 WT ## wt2 186.7171697 WT ## wt3 222.2867333 WTAfter replacing the two zero-count values of the# result of Shapiro-Wilk test on new data new.plotcountsdf %>% group_by(condition) %>% shapiro_test(count) ## # A tibble: 2 x 4 ## condition variable statistic p ## <fct> <chr> <dbl> <dbl> ## 1 WT count 0.819 0.160 ## 2 KO count 0.903 0.394KOcondition samples with very small values (essentially still zero-counted reads), the Shapiro-Wilk test now shows that the sample data is approximately normal (p > 0.05) and thus the null hypothesis of population normality cannot be rejected. It is safe to proceed with the parametric statistical test.
Since the assumptions are accepted as true, we can compute Student’s t-test. And because we are only interested in knowing about any difference between the groups and are less concerned about the direction of the difference, a two-tailed test (default) is appropriate.
Note: If we were testing a more specific, directional hypothesis (e.g., we only care if sample set 1 has a higher mean than sample set 2), then a one-tailed test would benefit us more because it has greater statistical power (p-values are exactly half the value they are in two-tailed tests). A one-tailed test can be specified with the inclusion of the
alternativeargument in thet_test()function, supplying either"less"or"greater"depending on the hypothesis being tested.
stat.test <- plotcountsdf %>% t_test(count ~ condition, var.equal = TRUE) %>% add_significance()
stat.test
## # A tibble: 1 x 9
## .y. group1 group2 n1 n2 statistic df p p.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 count WT KO 3 3 17.5 4 0.0000629 ****
Note: If the data is better suited to Welch’s t-test than Student’s t-test, simply remove the
var.equal = TRUEargument in thet_test()function. Additionally, the output table can be made more descriptive of the test results by adding thedetailed = TRUEargument in thet_test()function.
Now, to determine the effect size, Cohen’s \(d\) can be calculated.
plotcountsdf %>% cohens_d(count ~ condition, var.equal = TRUE)
## # A tibble: 1 x 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 count WT KO 14.3 3 3 large
Note: If Welch’s t-test was performed instead of Student’s t-test, change the
var.equal = TRUEargument in thet_test()function tovar.equal = FALSE. Also, it is important to understand that for small sample sizes (\(n\) < 50), Cohen’s \(d\) tends to over inflate the results.
The effect size was determined to be large! But this may only be due to our small sample size (\(n\) = 6).
Here is an example of how to report the results obtained from the statistical test above:
“A two-tailed Student’s t-test showed that the difference between the KO and WT expression of CISTR-ACT was statistically significant, t(4) = 17.5, p < 0.0001, d = 14.3”.
Even better, a summary of these results could be added to the box plot:
stat.test.fin <- stat.test %>% add_xy_position(x = "condition")
ggplot(plotcountsdf, aes(x=condition, y=count)) +
ggtitle("Normalized *CISTR* expression") +
labs(x = "Condition", y = "Read Count", subtitle = get_test_label(stat.test.fin, detailed = TRUE)) +
coord_cartesian(ylim = c(0, 220)) +
theme_classic() +
theme(panel.border = element_rect(colour = "black", fill = NA, size = NULL),
plot.title = ggtext::element_markdown(face="bold", colour="black", size=12, hjust=0.5),
plot.subtitle = element_text(colour="black", size=12, hjust=0.5),
axis.line = element_blank(),
axis.text = ggtext::element_markdown(),
legend.text = ggtext::element_markdown()) +
geom_boxplot(fill='gray', color="black") +
geom_jitter(shape=16, position=position_jitter(0.1)) +
stat_summary(fun=mean, colour="red", geom="text", show.legend = FALSE, vjust=-0.7, aes(label=round(..y.., digits=1))) +
ggpubr::stat_pvalue_manual(stat.test.fin, tip.length = 0)
If your experimental design included more than one condition (e.g., two conditions: gene knockout vs wildtype AND CRISPR activation of gene vs wildtype), you may be interested in comparing their differential gene expression profiles to identify similarities and differences. A Venn diagram is a great way to present this information when relatively few groups are being compared (in my opinion, from 2-4 groups). A great R package for creating aesthetic Venn diagrams and which integrates some of the customization options of the ggplot2 package is ggVennDiagram.
First, the IDs of significant DEGs from each condition of interest need to be compiled, as character vectors, into a list.
listGenes <- list(gene_knockout = row.names(resSig.ko.01),
gene_activate = row.names(resSig.up.05))
Now, the DEGs from each condition are ready to be parsed into common or unique bins for visualization by ggVennDiagram().
ggVennDiagram(listGenes[1:2], category.names = c("gene_knockout","gene_activate"),
set_size = 5, set_colour = "black",
edge_lty = "solid", edge_size = 1,
label_alpha = 0, label_percent_digit = 1, label_color = "black", label_size = 4) +
labs(title = "RPMI", subtitle = "KO(FDR<0.01, |LFC|>1), UP(FDR<0.05, |LFC|>0.6)",
fill = "# of Sig. Genes") +
scale_fill_distiller(palette = "Reds", direction = 1) +
scale_color_brewer(palette = "Greys") +
scale_x_continuous(expand = expansion(mult = .2)) +
theme(plot.title = element_text(face="bold", colour="black", size=12, hjust=0.5),
plot.subtitle = element_text(face="plain", colour="black", size=12, hjust=0.5),
legend.title = element_text(face="plain", colour="black", size=10, hjust=0.5),
legend.text = element_text(face="plain", colour="black", size=8),
legend.position = NULL)
Next, the IDs from each bin (unique or common) should be extracted. To do this, we first need to print the list data as a data frame, remove NA values (an artifact from the conversion of an uneven list object to a data frame which replaces blank values with NA), then save to a text file.
DEGlist <- as.data.frame(sapply(listGenes, "length<-", max(lengths(listGenes))))
DEGlist[is.na(DEGlist)] <- ""
write.table(DEGlist, "DEGlist.txt", append = FALSE, sep = "\t", row.names=F, col.names = TRUE, quote=F)
From this file, the gene IDs in each column, corresponding to each condition comparison, can be copied and pasted into the Venn/Euler diagram maker web tool (made by Center for Plant Systems folks at Ghent University’s Vlaams Instituut voor Biotechnologie in Belgium) to generate another file with the DEGs sorted into their respective Venn bins and formatted as a human-readable table. This textual table output should be downloaded to your working directory.
Note: For really nicely formatted condition intersection names in the final table it is recommended that, in the “Provide name for list (optional)” field, you input names WITH a comma at the end of the string. This will nicely separate the condition names. So instead of “
knockout activate”, you’ll get “knockout, activate,”, but the final comma will always be removed to appear as “knockout, activate”
This output table, saved as “VennBins.txt” in this example, should be loaded into R as a data frame so per-gene metadata (e.g., gene symbol, gene coordinates, etc.) and results (e.g., log2FoldChange) can be appended.
VennDF <- read.table("VennBins.txt", header=T, sep="\t")
colnames(VennDF) <- c("Conditions", "Total", "EnsemblID")
VennDF$Conditions <- substring(VennDF$Conditions,1, nchar(VennDF$Conditions)-1)
VennDF <- cbind(id = as.integer(rownames(VennDF)), VennDF)
VennDF <- merge(x = VennDF, y = ano[,c(1,3:8)], by.x="EnsemblID", by.y="row.names")
VennDF <- merge(x = VennDF, y = resLFC.ko.01.fin[,2, drop=F], by.x="EnsemblID", by.y="row.names", all.x=T, suffixes="ko")
VennDF <- rename(VennDF, log2FoldChange.ko = log2FoldChange)
VennDF <- merge(x = VennDF, y = resLFC.up.05.fin[,2, drop=F], by.x="EnsemblID", by.y="row.names", all.x=T, suffixes="up")
VennDF <- rename(VennDF, log2FoldChange.up = log2FoldChange)
VennDF <- relocate(VennDF, EnsemblID, .after = Total)
VennDF <- VennDF[order(VennDF$id), ]
VennDF <- VennDF[2:ncol(VennDF)]
write.table(VennDF, "VennBins_wChrPos&LFC.txt", sep="\t", row.names=F, quote=F)
Similar to a Venn diagram, an UpSet plot is a great option for depicting the various intersections of significant DEGs from across multiple conditions, but UpSet plots are much easier to interpret than Venn diagrams when comparing lists of more than 4 groups. A great R package for creating clean UpSet plots and which integrates some of the customization options of the ggplot2 package is UpSetR.
First, the IDs of significant DEGs from each condition of interest need to be compiled, as character vectors, into a list.
listGenes <- list(gene_knockout = row.names(resSig.ko.01),
gene_activate = row.names(resSig.up.05),
gene_knockdown = row.names(resSig.kd.01),
gene_txDeplete = row.names(resSig.td.01),
gene_crispDisp = row.names(resSig.cd.05))
Now, the DEGs from each condition are ready to be parsed into common or unique bins for visualization by upset().
upset(fromList(listGenes), nsets = 5, empty.intersections = "on", order.by = "freq",
mainbar.y.label = "# Sig. DE Genes", sets.x.label = "Total # Sig. DE Genes")
Note: Including the argument
empty.intersections = "on"in the code above allows empty intersections to be plotted. However, not all empty intersections will be plotted because the default number of intersections to be shown is reasonable for legibility. If it is desirable to show all possible intersections, also include the argumentnintersects = NA.
An MA-plot (A.K.A. Bland–Altman plot / difference plot / Tukey mean-difference plot) is meant to help understand whether the measurements between two assays or samples agree. With regard to genomics technologies, like RNA-seq, the ‘M’ scale (a ratio) is usually the log ratio of the fold change estimates and the ‘A’ scale (an average) is usually the mean of the normalized counts. These plots allow us to visualize the differences in gene expression across all genes.
Since the effect sizes of the original MLE log fold change estimates have already been shrunk (now Bayesian estimates), the variability from the less precise MLE estimates will be reduced and thus most genes are expected to not have a significant change in expression (they will be found close to 0 on the y-axis).
options(ggrepel.max.overlaps = Inf) # globally set the ability for all requested labels to show on plots, regardless of 'too many overlaps' warnings
pp <- ggmaplot(resLFC.01.fin, main = "TITLE", submain = "FDR<0.01, |LFC|>1",
xlab = "Log<sub>2</sub> mean expression", ylab = "Log<sub>2</sub> fold change",
legend.title = NULL, legend = "top",
fdr = 0.01, fc = 2,
size = 2, alpha = 0.5,
palette = c("#B31B21", "#1465AC", "darkgray"), # Up=red, Down=blue, NS=dark gray
genenames = as.vector(resLFC.01.fin$GeneSymbol),
select.top.method = c("padj"), top = 20,
font.label = c(8, "bold", "black"), label.rectangle = FALSE,
font.main = c(12, "bold", "black"),
font.legend = c(8, "bold", "black"),
xlim = NULL, ylim = c(-15, 15), # nlim = c(0, 50)
xticks.by = NULL, yticks.by = NULL, # if nticks.by=5, a tick mark is shown every 5 units
ggtheme = ggplot2::theme_classic())
pp + theme(panel.border = element_rect(colour = "black", fill = NA, size = NULL),
axis.line = element_blank(),
plot.title = element_text(face="bold", colour="black", size=12, hjust=0.5),
plot.subtitle = element_text(face="plain", colour="black", size=12, hjust=0.5),
axis.title.x = ggtext::element_markdown(),
axis.title.y = ggtext::element_markdown()) +
annotate("text", label = paste0("n = ", nrow(resLFC.01.fin)), x = Inf, y = Inf, size = 4, colour = "black", vjust=2, hjust=1.5)
# convert row names to column
input.ko <- cbind(gene=rownames(resLFC.01.fin), resLFC.01.fin)
# prescribe up/down labels to significant DEGs based on previously established criteria
input.ko$sig <- ifelse(input.ko$log2FoldChange > 1 & input.ko$padj < 0.01 & !is.na(input.ko$padj), "up", ifelse(input.ko$log2FoldChange < -1 & input.ko$padj < 0.01 & !is.na(input.ko$padj), "down", "ns"))
input.ko$sig <- factor(input.ko$sig, levels = c("ns", "down", "up"))
input.ko <- input.ko[order(input.ko$sig, input.ko$padj),] # sort by sig and if tie sort by padj
ggplot(input.ko, aes(log2(baseMean +1), log2FoldChange)) +
geom_point(aes(col=sig), alpha = 0.5) +
ggtitle("TITLE", subtitle = "FDR<0.01, |LFC|>1") +
labs(x = "Log2 mean of normalized counts", y = "Log2 fold change") +
scale_color_manual(values=c("ns"="darkgrey", "up"="#B31B21", "down"="#1465AC"),
labels = c(paste0("NS: ", length(grep("ns", input.ko$sig))),
paste0("Up: ", length(grep("up", input.ko$sig))),
paste0("Down: ", length(grep("down", input.ko$sig))))) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
geom_hline(yintercept=0, col="black", linetype="solid") +
geom_hline(yintercept=c(-1, 1), col="black", linetype="dashed") +
scale_x_continuous(breaks=seq(0, max(input.ko$baseMean), 2)) +
geom_text_repel(data=head(input.ko[order(input.ko$padj),], 20), aes(label=GeneSymbol),
fontface="italic", size=3, segment.alpha=0.4, min.segment.length=0, max.overlaps=Inf) +
annotate("text", label = paste0("n = ", nrow(input.ko)), x = Inf, y = Inf, size = 4, colour = "black", vjust=2, hjust=1.5) +
theme_classic() +
theme(panel.border = element_rect(colour = "black", fill = NA, size = NULL), axis.line = element_blank(),
plot.title = element_text(face="bold", colour="black", size=12, hjust=0.5),
plot.subtitle = element_text(face="plain", colour="black", size=12, hjust=0.5),
legend.title = element_blank(),
legend.position = "top", legend.direction = "horizontal",
legend.text = ggtext::element_markdown())
input.ko <- cbind(gene=rownames(resLFC.01.fin), resLFC.01.fin) # convert rownames to column
input.ko$sig <- "ns"
input.ko$sig[input.ko$padj < 0.01] <- "qval"
input.ko$sig[input.ko$log2FoldChange > 1 | input.ko$log2FoldChange < -1] <- "lfc"
input.ko$sig[input.ko$padj < 0.01 & (input.ko$log2FoldChange > 1 | input.ko$log2FoldChange < -1)] <- "qval & lfc"
input.ko$sig <- factor(input.ko$sig, levels = c("ns", "lfc", "qval", "qval & lfc"))
input.ko <- input.ko[order(input.ko$sig, input.ko$padj),] # sort by sig and if tie sort by padj
ggplot(input.ko, aes(log2FoldChange, -log10(padj))) +
geom_point(aes(col=sig), alpha = 0.5) +
ggtitle("TITLE", subtitle = "FDR<0.01, |LFC|>1") +
labs(x = bquote(~Log[2]~ 'fold change'), y = bquote(~-Log[10] ~ italic(q))) +
scale_color_manual(values = c("ns"="grey", "lfc"="black", "qval"="dark blue", "qval & lfc"="dark orange"),
labels = c(paste0("NS <br> (", nrow(input.ko[input.ko$sig == "ns",]),")"),
paste0("Log<sub>2</sub> fold change <br> (", nrow(input.ko[input.ko$sig == "lfc",]),")"),
paste0("*q*-value <br> (", nrow(input.ko[input.ko$sig == "qval",]),")"),
paste0("*q*-value & log<sub>2</sub> fold change <br> (", nrow(input.ko[input.ko$sig == "qval & lfc",]),")"))) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
coord_cartesian(xlim = c(-18, 18), ylim = NULL) +
geom_vline(xintercept=c(-1, 1), col="black", linetype="dashed") +
geom_hline(yintercept=-log10(0.01), col="black", linetype="dashed") +
geom_text_repel(data=head(input.ko[order(input.ko$padj),], 20), aes(label=GeneSymbol),
fontface="italic", size=3, segment.alpha=0.4, min.segment.length=0, max.overlaps=Inf) +
annotate("text", label = paste0("n = ", nrow(input.ko)), x = Inf, y = Inf, size = 4, colour = "black", vjust=2, hjust=1.5) +
theme_classic() +
theme(panel.border = element_rect(colour = "black", fill = NA, size = NULL), axis.line = element_blank(),
plot.title = element_text(face="bold", colour="black", size=12, hjust=0.5),
plot.subtitle = element_text(face="plain", colour="black", size=12, hjust=0.5),
legend.title = element_blank(),
legend.position = "top", legend.direction = "horizontal",
legend.text = ggtext::element_markdown())
Visualizing the expression of genes in a specific genomic neighbourhood relative to some key genomic position (e.g., gene start site) can occasionally reveal interesting regional effects of differential expression. A cis plot is one such way to depict this data. For it to be optimally effective, though, it is best to use a DESeqResults object that includes ALL genes, including unexpressed genes. To generate this data set, perform an iteration of the same DGE analysis described above but without the pre-filtering of unexpressed genes. Essentially:
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ condition)
ddsColl <- collapseReplicates(dds, dds$sample, dds$run)
dds$condition <- relevel(dds$condition, ref = "WT")
dds <- DESeq(dds)
res.01 <- results(dds, contrast=c("condition","KO","WT"), alpha=0.01)
resLFC.01 <- lfcShrink(dds, res=res.01, coef="condition_KO_vs_WT", type="apeglm")
resLFC.01.fin <- merge(as(resLFC.01, "data.frame"), ano, by.x="row.names", by.y="row.names")
resLFC.01.fin <- data.frame(resLFC.01.fin, row.names = 1)
Then, to easily subset the data to include only those genes which fall within a specified genomic window, the results need to be converted to a format compatible with the GenomicRanges::subsetByOverlaps() function: a GRanges object.
genes <- makeGRangesFromDataFrame(ano, keep.extra.columns=TRUE, seqnames.field="Chromosome", start.field="Start", end.field="End", strand.field="Strand")
genes.ko <- makeGRangesFromDataFrame(resLFC.01.fin, keep.extra.columns=TRUE, seqnames.field="Chromosome", start.field="Start", end.field="End", strand.field="Strand")
Next, if you want to investigate the cis expression landscape around a specific gene, find and store its coordinates (chromosome and 5’ start position) in variables. For this example, let’s suppose we are interested in observing the cis landscape of the long non-coding RNA gene, CISTR-ACT:
GENEchr <- as.character(seqnames(genes))[genes$GeneSymbol=="CISTR"]
GENEstart <- start(ranges(genes))[genes$GeneSymbol=="CISTR"]
Note: If you are not interested in the landscape of a specific gene or if your gene of interest is not yet annotated but you know which region you wish to visualize, these genomic coordinates can be used as input in the
GRanges()function below instead of theGENEchrandGENEstartvariables defined above.
The size of the window around your coordinate of interest (in bp) should now be supplied to the GRanges() function along with your coordinate of interest to ensure the final data subset encompasses only genes within the specified genomic window. The value stored in the window variable should be half the total size of your final window since it indicates how far to look up and downstream of the coordinate. Using a value of 2500000 will result in a 5 Mb window.
window <- 2500000
query <- GRanges(seqnames=GENEchr,
ranges=IRanges(start=GENEstart-window, end=GENEstart+window))
nhbrs <- subsetByOverlaps(genes, query) # the real subset (below) should contain the same number of genes as this reference subset
nhbrs.ko <- subsetByOverlaps(genes.ko, query)
Note: If you do not want a symmetric window around your coordinate of interest, you cannot use the window variable as it is used above. Instead, provide the number of bp you wish to visualize up and downstream of this coordinate directly in the
GRanges()function.
Now, convert the subset GRanges object back to a data.frame object for better compatibility with ggplot() and base R functions.
nhbrs.ko <- as.data.frame(nhbrs.ko)
A ‘distance from coordinate’ column needs to be added to the data frame and all NA values in the log2FoldChange column must be changed to zeros.
nhbrs.ko$dist <- nhbrs.ko$start-GENEstart
nhbrs.ko$log2FoldChange[is.na(nhbrs.ko$log2FoldChange)] <- 0
Establish some labeling system to differentiate significant DEGs, the reference gene (if used), and non-expressed genes.
nhbrs.ko$sig <- "no"
nhbrs.ko$sig[nhbrs.ko$padj < 0.01 & (nhbrs.ko$log2FoldChange > 1 | nhbrs.ko$log2FoldChange < -1)] <- "yes"
nhbrs.ko$sig[nhbrs.ko$baseMean == 0] <- "nd"
nhbrs.ko$sig[nhbrs.ko$GeneSymbol == 'CISTR'] <- "ref"
nhbrs.ko$sig <- factor(nhbrs.ko$sig, levels = c("no", "nd", "yes", "ref")) # left to right, bottom to top layers
nhbrs.ko <- nhbrs.ko[order(nhbrs.ko$sig),]
Finally, generate the plot.
ggplot(nhbrs.ko, aes(x=dist, y=log2FoldChange)) +
ggtitle("TITLE") +
labs(x = "Genomic position (Mb)", y = "log2FoldChange (KO/WT)", colour="Significance", fill="Significance") +
geom_point(aes(colour=sig, fill=sig), shape=21) +
scale_fill_manual(values = c("nd"="white", "no"="grey", "yes"="black", "ref"="dark red"),
labels = c(paste0("Not Detected <br> (", nrow(nhbrs.ko[nhbrs.ko$sig == "nd",]),")"),
paste0("NS <br> (", nrow(nhbrs.ko[nhbrs.ko$sig == "no",]),")"),
paste0("*q*-value & log<sub>2</sub> fold change <br> (", nrow(nhbrs.ko[nhbrs.ko$sig == "yes",]),")"),
paste0("Reference <br> (", nrow(nhbrs.ko[nhbrs.ko$sig == "ref",]),")"))) +
scale_color_manual(values = c("nd"="black", "no"="grey", "yes"="black", "ref"="dark red"),
labels = c(paste0("Not Detected <br> (", nrow(nhbrs.ko[nhbrs.ko$sig == "nd",]),")"),
paste0("NS <br> (", nrow(nhbrs.ko[nhbrs.ko$sig == "no",]),")"),
paste0("*q*-value & log<sub>2</sub> fold change <br> (", nrow(nhbrs.ko[nhbrs.ko$sig == "yes",]),")"),
paste0("Reference <br> (", nrow(nhbrs.ko[nhbrs.ko$sig == "ref",]),")"))) +
scale_x_continuous(labels = scales::unit_format(accuracy=0.1, unit=NULL, scale=1e-6), expand=c(0,0), sec.axis=dup_axis()) +
coord_cartesian(xlim = c(-window, window), ylim = NULL) + # if you did not specify a symmetric window before, change the values to what you used instead
geom_hline(yintercept=0, linetype=2, col="grey") +
geom_text_repel(data=head(nhbrs.ko[order(nhbrs.ko$padj),], 20), aes(label=GeneSymbol),
fontface = "italic", size=3, segment.alpha=0.4, min.segment.length=0, max.overlaps=Inf) +
annotate("text", label = paste0("n = ", nrow(nhbrs.ko)), x = Inf, y = Inf, size = 4, colour = "black", vjust=2, hjust=1.5) +
theme_classic() +
theme(panel.border = element_rect(colour = "black", fill = NA, size = NULL),
axis.line = element_blank(),
plot.title = ggtext::element_markdown(face="bold", colour="black", size=12, hjust=0.5),
legend.title = element_blank(),
legend.position = "top",
legend.direction = "horizontal",
legend.text = ggtext::element_markdown(),
axis.text = ggtext::element_markdown(),
axis.title.x.top = element_blank(),
axis.text.x.top = element_blank())
Visualizing genomic data as an ideogram can be a super useful tool because it can provide genome-wide context to the data at a glance. To use our data with the karyoploteR package, it must first be converted to a GRanges object.
genes <- makeGRangesFromDataFrame(ano, keep.extra.columns=TRUE, seqnames.field="Chromosome", start.field="Start", end.field="End", strand.field="Strand")
genes.ko <- makeGRangesFromDataFrame(resLFC.01.fin, keep.extra.columns=TRUE, seqnames.field="Chromosome", start.field="Start", end.field="End", strand.field="Strand")
Next, the data must be further prepared for each of the karyoploteR functions.
# remove genes with NA values in padj column
filtered.genes.ko <- genes.ko[!is.na(genes.ko$padj)]
# create a log.padj column
mcols(filtered.genes.ko)$log.padj <- -log10(filtered.genes.ko$padj)
# apply significance cut-offs to your data
sig.genes.ko <- subset(filtered.genes.ko, padj < 0.01 & (log2FoldChange > 1 | log2FoldChange < -1))
# assign binary system of 1=up and 2=down to DEGs
mcols(sig.genes.ko)$binaryLFC <- ifelse(sig.genes.ko$log2FoldChange>0, 1, 2)
# find highest LFC value and round to nearest integer
fc.ymax.ko <- ceiling(max(abs(range(sig.genes.ko$log2FoldChange))))
# find negative LFC max
fc.ymin.ko <- -fc.ymax.ko
Then, the plotting parameters need to be customized for neater spacing of graphical elements. For the examples in this vignette, in which we exclusively use plot type 2, the following custom parameters will be sufficient:
pp <- getDefaultPlotParams(plot.type=2)
pp$data2inmargin <- 60
pp$data1height <- 400
pp$data2height <- 80
Visit the karyoploteR Tutorial and Examples github repo by Bernat Gel and Eduard Serra to learn how to fully customize the plotting parameters of the figure and to see alternative ways to visualize genomic data using gene coordinates.
kp <- plotKaryotype(genome="hg38", ideogram.plotter = NULL, plot.type=2)
kpAddCytobandsAsLine(kp)
kpAddMainTitle(kp, "Human Genome - All Genes", cex=2)
kpPlotRegions(kp, data=genes[strand(genes)=="+"], avoid.overlapping = FALSE, col="deepskyblue", data.panel=1)
kpPlotRegions(kp, data=genes[strand(genes)=="-"], avoid.overlapping = FALSE, col="gold", data.panel=2)
kpAddLabels(kp, "strand +", data.panel=1, cex=0.8, col="#888888")
kpAddLabels(kp, "strand -", data.panel=2, cex=0.8, col="#888888")
To plot all significant differentially regulated genes in genomic space, the DEG subset is used. Prescribing colours to each gene block to represent up (red) and down (blue) regulation is a nice additional layer of information for the viewer.
kp <- plotKaryotype(genome="hg38", plot.type=2, labels.plotter=NULL, plot.params=pp)
kpAddMainTitle(kp, "Significant Differentially Expressed Genes", cex=0.8)
kpAddChromosomeNames(kp, cex=0.3)
kpAddBaseNumbers(kp, cex=0.25)
at <- autotrack(current.track=1, total.tracks=1, margin = 0.2)
kpHeatmap(kp, data.panel=1, data=sig.genes.ko, y=sig.genes.ko$binaryLFC, col=c("#1465ACAA", "#B31B21AA"), r0=at$r0, r1=at$r1)
kpAddLabels(kp, data.panel=1, r0=at$r0, r1=at$r1, labels="KO log2 Fold Change", cex=0.25, col="#888888", srt=0, pos=1, label.margin=0.04)
kpPlotDensity(kp, data.panel=2, data=genes, window.size=10e5, border="#FFFFFF00", col="#F4BD15")
kpAxis(kp, data.panel=2, side=1, cex=0.2, ymax=kp$latest.plot$computed.values$max.density, label.margin=-5)
kpAddLabels(kp, data.panel=2, labels="gene density", cex=0.25, col="#888888", srt=0, pos=1, label.margin=0.04)